This guide walks you through setting up your own Large Language Model (LLM) server using Ollama on an Ubuntu VM with NVIDIA GPU passthrough in Proxmox.Documentation Index
Fetch the complete documentation index at: https://docs.hyko.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is This Setup?
This configuration allows you to:- Run LLMs locally on your own hardware with GPU acceleration
- Host models like Llama, Mistral, or GPT-OSS for private AI inference
- Achieve faster response times compared to CPU-only inference
- Maintain data privacy by keeping everything on your infrastructure
- Access your LLM via web UI similar to ChatGPT
What is GPU Passthrough?
GPU passthrough (also called PCIe passthrough) allows a virtual machine to directly access a physical GPU, bypassing the hypervisor layer. This means:- Near-native performance: Your VM gets almost the same GPU performance as bare metal
- Direct hardware access: The VM controls the GPU as if it were physically installed
- Exclusive access: Only one VM can use the passed-through GPU at a time
- Required for GPU compute: Essential for running LLMs with GPU acceleration in VMs
Prerequisites
Before starting, you need:- A Proxmox server with an NVIDIA GPU installed
- An Ubuntu Server VM (22.04 or later recommended)
- Docker installed on the VM
- Basic familiarity with Linux command line
- SSH access to your VM
- Sufficient VRAM on your GPU
- Small models (7B parameters): 8GB VRAM minimum
- Medium models (13B-20B): 16GB+ VRAM
- Large models (30B+): 24GB+ VRAM
Step 1: Configure GPU Passthrough in Proxmox
Follow this video tutorial to set up GPU passthrough from your Proxmox host to your Ubuntu VM: 📹 Proxmox GPU Passthrough Guide The video covers:- Enabling IOMMU in BIOS
- Configuring Proxmox for PCIe passthrough
- Adding the GPU to your VM
- Verifying the setup
Step 2: Install NVIDIA Drivers
The NVIDIA drivers enable your Ubuntu system to communicate with the GPU hardware. Follow this guide for driver installation on Ubuntu: 📖 NVIDIA Driver Installation Guide Quick verification after installation:Step 3: Install CUDA Toolkit
CUDA is NVIDIA’s parallel computing platform required for GPU-accelerated applications. Download and install CUDA from the official source: 📦 CUDA Toolkit Downloads Select your operating system, architecture, and distribution to get the appropriate installation commands.Step 4: NVIDIA Container Toolkit
Install NVIDIA Container Toolkit: Follow the official installation guide to enable GPU access in Docker containers: 📖 NVIDIA Container Toolkit Installation Verify GPU access in Docker:nvidia-smi output as before, confirming Docker can access the GPU.
Step 5: Deploy Ollama and Open WebUI
Create a directory for your setup:docker-compose.yml:
Ollama Service
- OLLAMA_KEEP_ALIVE: -1: Keeps models loaded in GPU memory indefinitely for instant responses
- Port 11434: API endpoint for model inference
- Volume: Persists downloaded models between restarts
- GPU reservation: Ensures the container can access all available GPUs
Open WebUI Service
- Port 80: Web interface accessible at
http://your-vm-ip - ENABLE_ADMIN_CHAT_ACCESS: false: Disables admin user from accessing all chats (i mean.. its kinda creepy to check your employees chats)
- host.docker.internal: Allows the web UI to communicate with Ollama
- Volume: Stores user data, conversations, and settings
Step 6: Access Open WebUI and Download Models
Open your web browser and navigate to:- Click on your profile icon in the top right
- Go to Admin Panel → Settings → Models
- In the “Pull a model from Ollama.com” field, enter a model name
- Click the download button
Step 7: Verify GPU Acceleration
Check that your model is running on the GPU:- PROCESSOR: 100% GPU ✅ - Model is running on GPU (good!)
- PROCESSOR: 100% CPU ❌ - Model fell back to CPU
- UNTIL: Forever ✅ - Model stays loaded (due to
OLLAMA_KEEP_ALIVE: -1)
Step 8: Customize Model Context Length
The context window determines how much text the model can remember in a conversation. Larger contexts allow for longer discussions but use more VRAM. Access Ollama’s interactive mode:- Sets context to 10,000 tokens
- Saves as a new model variant with the custom context
- The new model persists these settings permanently
nvidia-smi after changing context length.
Verify your custom model: